lcc with ASDL and the IR it generates

Example#1: An unitialized global integer
Example#2: Initialized global variables
Example#3: Array, structure and union
Example#4: Local variables
Example#5: Function with parameters and arithmetic
Example#6: Pointer and array
Example#7: Function call
Example#8: if-else if-else
Example#9: switch
Example#10: while loop
Example#11: do loop
Example#12: for loop

All in all...
Things still unknown and undone

What is this all about

This article is meant to give a feel of the intermediate representation generated by lcc so as to make it easier to work upon it for whatever purpose, may it be understanding the compiler or writing optimizers. The way this document is structured is that we take various example C inputs to lcc and try to look at and understand the generated ASDL pickle. I also show (in some cases) the assembly code generated. This is x86 assembly code for win32.

It is pretty much assumed that you have already read about lcc and ASDL but epecially these two papers :

[1] A Code Generation Interface for ANSI C by David R. Hanson and Christopher W. Fraser.
[2]Early Experience with ASDL in lcc by David R. Hanson.

Look at the page for ASDL and Using lcc with ASDL to know how to set up everything.

I have put all the relevant stuff (atleast for the first reading) in the same HTML file, this makes it easier to get a printout.

lcc's ASDL file

Before directly starting with examples of C code and the generated ASDL, I have just pasted the ASDL file here. The explanation is given in "Early Experience with ASDL in lcc by David R. Hanson". Also, the ASDL being so simple to read and understand, this doesn't have to be explained in great detail. You'll get the feel as you go through the examples.

-- lcc IR -- $Id: asdl.nw,v 1.24 1998/09/21 21:05:15 drh Exp $ module rcc { -- Pickles start with an int version number, followed by rcc.program program = (int nuids,int nlabels,item* items,interface* interfaces,int argc,string *argv) real = (int msb,int lsb) item = Symbol(symbol symbol) | Type(type type) attributes(int uid) symbol = (identifier id,int type,int scope,int sclass,int ref,int flags) field = (identifier id,int type,int offset,int bitsize,int lsb) enum = (identifier id,int value) type = INT | UNSIGNED | FLOAT | VOID | POINTER(int type) | ENUM(identifier tag,enum* ids) | STRUCT(identifier tag,field* fields) | UNION(identifier tag,field* fields) | ARRAY(int type) | FUNCTION(int type,int* formals) | CONST(int type) | VOLATILE(int type) attributes(int size,int align) interface = Export(int p) | Import(int p) | Global(int p,int seg) | Local(int uid,symbol p) -- includes formals | Address(int uid,symbol q,int p,int n) | Segment(int seg) | Defaddress(int p) | Deflabel(int label) | Defconst(int suffix,int size,int value) | Defconstf(int size,real value) | Defstring(string s) | Space(int n) | Function(int f,int* caller,int* callee,int ncalls,interface* codelist) | Blockbeg | Blockend | Forest(node* nodes) node = CNST(int value) | CNSTF(real value) | ARG(node left,int len,int align) | ASGN(node left,node right,int len,int align) | CVT(int op,node left,int fromsize) | CALL(node left,int type) | CALLB(node left,node right,int type) | RET | ADDRG(int uid) | ADDRL(int uid) | ADDRF(int uid) | Unary(int op,node left) -- INDIR RET JUMP NEG BCOM | Binary(int op,node left,node right) -- ADD SUB DIV MUL MOD BOR BAND BXOR RSH LSH | Compare(int op,node left,node right,int label) -- EQ NE GT GE LE LT | LABEL(int label) | BRANCH(int label) | CSE(int uid,node node) attributes(int suffix,int size) }

If you notice, a lot of types above have "uid"s. uids (stands for unique ids), are used in the lcc ASDL description to capture the cyclic graph-like structure of the lcc IR. One limitation of ASDL is that it can describe only trees. But in the lcc IR we need different objects to point to others and it is not strictly a tree. This effect is achieved by having uids in the ASDL description of the lcc IR. Since uids are unique by definition, the uid field will tell you which is the IR object that is being referred to.

Examples

Okay, lets get started with the examples. Before explaining any example, there'll be three links. One to the C input given to lcc, one to the pretty printed HTML ASDL, and the third to the corresponding win32 assembly code. However, its not necessary to follow these links (atleast not in the first reading). I have tried to have all the relevant material in this document. The complete inputs and outputs are only for completeness.

Example#1: An unitialized global integer

The C code
The generated IR
The corresponding assembly

Lets see whats the ASDL generated for a simple one line C file that declares one uninitialized global integer called "i".

The C input

/*An untitialized global int */ int i;

We generate the ASDL pickle for this input and pretty print it to HTML using 2html.exe which can be built with lcc.

This pretty printed ASDL pickle is given next:

The Generated IR


nuids = 3
nlabels = 1
items = item list
Type : item
uid = 2
type = INT : type
size = 4
align = 4




Symbol : item
uid = 1
symbol = symbol
id = i
type = 2
scope = GLOBAL
sclass = AUTO
ref = 0
flags = 0






interfaces = interface list
Segment : interface
seg = BSS


Export : interface
p = 1


Global : interface
p = 1
seg = BSS


Space : interface
n = 4




argc = 5
argv = string list
24,'E:\APPS\LCC\BIN\rcc.exe'
18,'-target=x86/win32'
6,'-asdl'
46,'C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\lcc26680.i'
7,'f1.pkl'

The corresponding assembly code



.486

.model flat

extrn __fltused:near

extrn __ftol:near

_DATA segment

public _i

align 4

_i label byte

db 4 dup (0)

_DATA ends

end

Since this is the first example, I have given and will explain this entire pickle. For later examples, I'll neither give nor explain the entire pickle but just show and explain the parts that I think are relevant.

Now lets look at the ASDL. An lcc compilation unit - the C file is represented by the type "program" in the ASDL description. As seen in the lcc ASDL file the type program was program = (int nuids,int nlabels,item* items,interface* interfaces,int argc,string *argv)

If you look at the first two lines of the generated IR, we see nuids = 3 and nlabels = 1. This tells the backend, or optimizer or whoever is using this IR, that there are 3 uids (but I see only two!) in the generated IR. This leaves us with three other fields of program - the list of items, the list of interfaces, argc, and argv.

The list of items is basically a dump of lcc's internal symbol table. As seen in lcc IR's ASDL description, an item is a symbol or a type, both of which typically go into the symbol table. As for argc and argv they are just passed through from rcc and contain rcc's command line options to be passed to pass2, which is the backend that'll read the pickle and generate assembly code.

Lets look at the two items, in the item list that lcc has generated for our C input.

items = item list
Type : item
uid = 2
type = INT : type
size = 4
align = 4




Symbol : item
uid = 1
symbol = symbol
id = i
type = 2
scope = GLOBAL
sclass = AUTO
ref = 0
flags = 0

The first is the type int. It is an item which is a type (not a symbol). Its uid is 2, its size if 4 bytes, and the alignment is 4.

The second is the symbol information for the variable i. It has uid 1. The id field (declared with type identifier in the ASDL description), has the name "i". The next field "type" field has an integer which will contain the uid for the type of this variable. In this case it is 2, which tells us that the uid of the type of i is 2, meaning that the variable is of type int. The scope tells you that this is a global variable. The values that scope can have are GLOBAL, PARAM and LOCAL+k (where k tells the nesting level of the block where this variable is declared). Next comes the storage class, which in this case is AUTO. The other values for storage class are STATIC, AUTO, EXTERN, and REGISTER. An interesting point to note here is that if "i" had been declared as static int i , instead of int i, we'd have had the scope still GLOBAL, but the storage class STATIC. This means that the variable will still be considered a global variable, but the storage class will tell the back end not to export the variable outside this object file. The field ref stores how many times this symbol is mentioned (actually referenced, the declaration does not count as a ref) in the code. What the next field "flags" is I am not sure yet.

Now look at the interfaces. The represent the functions of the backend that would've been called by the front end, if there was no ASDL enabled and the lcc front end and backend were not separated.

The first interface is the backend's segment function. Segment : interface seg = BSS
The segment in this case is BSS. This tells the backend that the next interface calls pertain to BSS. BSS stands for the uninitialized variables. Often, the code generators use the same segment "data" for both BSS and DATA. This call is what generated the line. _DATA segment
in the assembly code.

The next interface is Export : interface p = 1
the "p" here, is the uid, which in this case is the uid for the variable i. The export interface in the backend looked at the name of the variable with uid = 1, and found out that it was "i" and hence generated the line
public _i
in the assembly code.

The next interface is Global : interface p = 1 seg = BSS
This was used by the backend for its interface function "global". The backend knows that the variable with uid = 1, is a global variable. It goes checks and finds that its in this same segment (BSS) and that its size is 4 and so is its alignment. Hence it generates the assembly code align 4 _i label byte

The last interface call is Space : interface n = 4
This call tells the backend that it has to allocate space of 4 bytes (n=4) and initialize it to zero. Hence it generates the assembly code db 4 dup (0)

The other lines of assembly code (like the very beginning and the end of the segment) were automatically generated by the code generating backend.

Example#2: Initialized global variables

The C code
The generated IR
The corresponding assembly

This example is for us to understand how initialized global variables are dealt with.

The C input

/* an initialized static float */ static float f = 12.2; /* an initialized character pointer */ const char *pch = (char *)0x1234;

An interesting thing in the code above, is that f is a static variable, and that pch has a pretty complicated type. Lets see how all this is represented by the lcc IR. As mentioned before, I won't give the entire IR here, but just parts of interest.

For one, we see that "sclass" for the symbol f is not "STATIC" and not AUTO". We also see that there is no "Export" in the interface list for the symbol f, but there is one for the symbol pch. This of course, is because f is a static variable, while pch is global.

Also, look at how values are assigned, so to speak. The interface "Global" is immediately followed by the interface to define the constant value. Lets see how this is done for the symbol f. Here's the generated interface for f.

Segment : interface seg = DATA Global : interface p = 1 seg = DATA Defconstf : interface size = 4 value = (0X60000000,0X40286666) = 12.2

The line of interest is the Defconstf - define a floating point constant. As seen in the ASDL for the lcc IR, a floating point value is represented by two integers (since there is no type for floating point numbers in ASDL). Also see that it immediately follows the interface "Global". This ensures that the assembler knows that the label "_f" refers to that place in the data segment where the value 12.2 is stored.

It now remains to look at how const char * is represented in the IR.

Type : item uid = 6 type = INT : type size = 1 align = 1 Type : item uid = 5 type = CONST : type size = 1 align = 1 type = 6 Type : item uid = 4 type = POINTER : type size = 4 align = 4 type = 5

You see it all above, and there's not much to explain. The type const char * is captured in three items. One is a POINTER, which then refers to CONST, which in turn refers to INT. Notice that there's no such type called char in the IR. We have INT with size = 1, and align = 1.

The assembly code generated for this exampleis straight forward and easy to understand.

Example#3: Array, structure and union

The C code
The generated IR
The corresponding assembly

This example shows how arrays, structures and unions are represented. The only things to look at in this example are the type items. Other things like symbols and names are not different.

The C input



/* an array */

double chs[10];



/* structure declaraton */

struct t_s

{

     struct t_s* pts;

     int int1;

}s;



/* union declaration */

union t_u

{

     union t_u* pu;

     int int2;

}u;



/* typedef for pointer to that structure */

typedef struct t_s *pt_s;

This is how the array type is represented. Type : item uid = 10 type = FLOAT : type size = 8 align = 4 Type : item uid = 9 type = ARRAY : type size = 80 align = 4 type = 10

So you see there's an item called ARRAY which refers to the FLOAT item. The size of array is 80 bytes. Also note how double is represeented as a FLOAT with size 8.

This is how the structure is represented

Type : item
uid = 7
type = STRUCT : type
size = 8
align = 4
tag = t_s
fields = field list
field
id = pts
type = 8
offset = 0
bitsize = 0
lsb = 0


field
id = int1
type = 6
offset = 4
bitsize = 0
lsb = 0

Where the types referred to are Type : item uid = 8 type = POINTER : type size = 4 align = 4 type = 7 Type : item uid = 6 type = INT : type size = 4 align = 4

So we see the use of the type POINTER, and it refers to 7, which is the type for the struct. We also see how the fields in a structure are represented. The offsets are 0 and 4 as expected. The bitsize field is for use when the field is defined in terms of the number of bits. The 'lsb' field is used when some field is defined in terms of bits. For example look at, struct s { int a:2; int b:1; char c; };
Here, the lsb field for 'a' will have value 1 and 'b' wil have 3. So, lsb is the bit offset for a field in the byte. Of course, the bits won't be in the same byte if the definitions of the bit-size field are not consecutive.

Unions are represented exactly like structs are except, of course, that there'll be UNION in type and not STRUCT.

Also notice that the typedef never appears in the IR. It has been, in some sense, handled like a preprocessor directive. So all instances of pt_s would be replaced by t_s *, and there's no need to have typedefs after semantic checking is done by the front end.

Example#4: Local variables

The C code
The generated IR
The corresponding assembly

Now that we've seen how global and file static variables are handled, we move on to local variables. Local variables exist on stack (unless declared static), only for that particular activation of the function. Hence, there will be code to make space for them on stack, and to assign values to them.

The C input



void foo()

{

     int p = 7;

     {

          char p = (char)p;

          static int n;

     }

}

Notice that there are two blocks in this input and the same variable p has been declared at different levels. Also, we have declared n as static.

Since the two variables p are local, there is no global symbol entry for them in the IR. They will exist in the local symbol table in the "local" interface, given in the ASDL description as Local(int uid,symbol p) -- includes formals

Therefore there'll be only two global symbols in the IR now - one for foo and another for n. Symbol : item uid = 1 symbol = symbol id = foo type = 7 scope = GLOBAL sclass = AUTO ref = 0 flags = 0 Symbol : item uid = 2 symbol = symbol id = n type = 3 scope = LOCAL+1 sclass = STATIC ref = 0 flags = 0

Notice the LOCAL + 1 for the scope of n. We also had a first look at how the symbols for function names are represented.

This is the first example, where we see some code - the first use of the code segment. Also the first example with the function interface. Here's the specification of the function interface in lcc's ASDL description. Function(int f,int* caller,int* callee,int ncalls,interface* codelist)

Now lets look at what this function interface contains for our function foo.

Function : interface


f = 1
caller = empty int list

callee = empty int list

ncalls = 0

The first three fields are straightforward. f points to the symbol uid for this function which is uid 1. I'll discuss caller and callee in the next example - the one that has a function with parameters. ncalls, which gives the number of calls made by this function (useful to identify the innermost function and treat it differently in various optimization schemes and even in code generation), is 0, since foo doesn't call any other function.

Next comes the code list, which is a list of interfaces.

codelist = interface list Blockbeg : interface Local : interface uid = 4 p = symbol id = p type = 3 scope = LOCAL sclass = AUTO ref = 10000 flags = 0 Forest : interface nodes = node list ASGN : node suffix = I size = 4 left = ADDRL : node suffix = P size = 4 uid = 4 right = CNST : node suffix = I size = 4 value = 7 len = 4 align = 4 Blockbeg : interface Local : interface uid = 6 p = symbol id = p type = 5 scope = LOCAL+1 sclass = AUTO ref = 10000 flags = 0 Forest : interface nodes = node list ASGN : node suffix = I size = 1 left = ADDRL : node suffix = P size = 4 uid = 6 right = Unary : node suffix = I size = 1 op = INDIR left = ADDRL : node suffix = P size = 4 uid = 6 len = 1 align = 1 Blockend : interface Blockend : interface Forest : interface nodes = node list LABEL : node suffix = V size = 0 label = 1

Lets now in brief go over every interface in this code list.

The Blockbeg interface tells the backend that this is the beginning of a block.
Next is the local interface, which has the symbol entry for the outer p.
The forest interface comes next, and it represents the code for initializing the outer p to the value 7.
This is followed by one more Blockbeg interface to represent the beginning of the inner block.
Next is the local symbol table for this inner block, which contains the symbol entry for the inner p.
Now the forest interface for the code which will assign the value of the outer p (after conversion to char) to the inner p.
This is followed by the Blockend interface that signals the end of the inner block.
Then comes the Blockend for the outer block.
The last Forest interface, just declares a variable. This is to be a label to the exit code of the function. This label at the end of the function and a branch to it are needed when there is a specific instruction in the end performing RET. An example of this could be i/o register switching in SPARC. It is preferable to have a single copy of this instruction - so we need a branch. .

However, if you look at the assembly code generated for this, you don't see any corresponding code for Blockbeg and Blockend. But these are nevertheless faithfully represented by the frontend IR, and are used by the backend for book keeping - it just has no line of assembly code that corresponds to it. Here's the assembly code for the function foo. _TEXT segment _foo: push ebx push esi push edi push ebp mov ebp,esp sub esp,8 mov dword ptr (-4)[ebp],7 mov byte ptr (-5)[ebp],97 L1: mov esp,ebp pop ebp pop edi pop esi pop ebx ret _TEXT ends

Notice the pushes at the function beginning and the pop at the end - this is due to the function interface. Look at how mov ebp,esp sub esp,8
makes space on stack for the two variables called p (not for n because its static) for this activation of the function.

Well now lets look at the Forest interface and the representation for instructions. The line p = 7, is translated as, ASGNI (ADDRLP p) (CNSTI 7)

ASGN stands for assign, ADDRL for address of local, and CNST for constant. In ASGNI, ADDRLP, and CNSTI the suffixes I, P and I represent the type information for these instructions. I is for integer, P is for pointer (an address of a local variable will of course have the type pointer). The len field for ASGN, I think, gives the number of bytes that should be copied in this assignment. Its 4 in this case.

The line "p=p" is translated as ASGNI (ADDRLP p) (INDIRI (ADDRLP p))

But wait! the uid for both these variables p are same! Well, this is correct because both these variables p in "p=p", refer to the same p - the inner one. So if the programmer ever meant to assign the value of the outer p to the inner one, he'll be disappointed. Our compiler faithfully follows the rules of the C language.

Notice how the value of p is used by doing an INDIR on the ADDRL of p. Also, in INDIRI the size is 1, so the byte value will be used. This is because p is a char.

Well, thanks to this example, we now know a bit of how functions, local variables, and some simple code is represented.

Example#5: Function with parameters and arithmetic

The C code
The generated IR
The corresponding assembly

This example tells us a lot of things. We see how functions with parameters get represented. We see some arithmetic being done, and also some things about what callee and caller are in the function interface.

The C input



foo(int a, float b)

{

     return (a+b)/b;

}

Lets look at how the type of this function is represented in the item list. Type : item uid = 10 type = FUNCTION : type size = 0 align = 0 type = 2 formals = int list2, 4

The "type" filed here holds uid 2 which is for INT. Also the parameter types are in formals, and they contain UIDs 2 and 4. As mentioned, 2 is INT, 4 is FLOAT but has size 8. This is because the C standard says that when a parameter is declared as a float, its passed along as a double. This incidentally also explains the need for two sets of symbols in the symbol table called the callee list and the caller list. You can see them in the Function interface: Function(int f,int* caller,int* callee,int ncalls,interface* codelist)

The caller symbol for float b in this example will have type FLOAT and size 8, where as the corresponding callee symbol b will have type FLOAT and size 4. This is because the code in the function sees it asa float, but it has to be passed as a double. Thats why, in the beginning of the function, in the code list, there'll be code to copy the caller values into the callee values. Of course, more often than not, the caller types and callee types are exactly identical, but we just saw one of the exceptions. There are other cases where they might differ. For details, refer to the paper[1] I've mentioned in the introduction.

The symbols for these parameters are defined by the Local interface. But this local interface is not part of the list of interfaces called "codelist" in the Function interface. The local interfaces for parameters come up at the top level in the interface list in "program". The Local interface and symbol tables for local variables (as opposed to function parameters) come in the codelist interface list in the Function interface.

Another interesting in this case is that lcc considers the b in (a+b)/b to be a common subexpression. So it'll create a CSE node, to store this value in a temporary location. Therefore, we'll see this temporary symbol in the Local interface in codelist in the Function interface. Local : interface uid = 9 p = symbol id = 1 type = 7 scope = LOCAL sclass = REGISTER ref = 10000 flags = temporary|generated

The id here is 1. It'll be 1, 2, 3... for all the internally generated temporary variables. Temporary variables are created when common subexpressions are detected. They go hand-in-hand with the CSE node in the node list. Other interesting things are the REGISTER in sclass, and the flags field. I am not however sure what flags and ref stand for exactly.

The next goodie about this example is the code. I won't give code representation here - its too long. You can see it in the file. What I give here is a short lisp-like representation of the forest of nodes. I have written the name T1, in places where our temporary variable is used.

The Forest:


 ASGNF (ADDRLP T1) (CSE T1 (INDIRF (ADDRFP b)))
 RET (CVFI (DIVF 

(ADDF (CVIF (INDIRI (ADDRFP a))) (INDIRF (ADDRLP T1))

(INDIRF (ADDRLP T1)))))

The code, as such, is quite straight forward. Notice that b is recognized as a common subexpression (the CSE node) and has been put into the generated temporary T1. ADDRL (address of local) is used for T1, and ADDRF (address of formal) is used for a and b. a is converted to a float, the calculation is done, and then the result is converted back to integer and is then returned as an integer.

Thats all about this example input. Its been pretty instructive.

Example#6: Pointer and array

The C code
The generated IR
The corresponding assembly

This is a small example to see how pointer and array access are compiled.

The C input



void foo(int *p)

{

     int q[10], i;

     q[i] = *p++;

}

Really speaking there's not much new in this example. We already know how functions, formal parameters and local variables are represented. We also know how expressions are represented - just the operators are different this time. But well, lets have a look at these different things.

The array type is represented like this: Type : item uid = 6 type = ARRAY : type size = 40 align = 4 type = 3

There's nothing else much of note. Lets have a look at the codelist generated. Again I give it in simple LISP like notation:


 ASGNP (ADDRL T1) (INDIRP (ADDRFP p)) 
 ASGNP (ADDRFP p) (ADDP (INDIRP (ADDRLP T1))  (CNSTI 4)) 
 ASGNP (ADDP (LSHI (INDIRI (ADDRLP i)) (CNSTI 2))) (INDIRI (INDIRP (ADDRLP T1)))

Look at the generated variable T1. This time it wasn't generated for CSE but to store the value in of the pointer p. This is because, I think, when confronted with a C assignment expression, lcc chooses to finish everything on the right hand side, before looking at the left. So it chooses to increment p (due to the postfix increment ++ operator), before assigning *p++ to q[i]. So it generates a temporary variable T1, to hold the value of p. The dereferencing to get the value stored at *p, will be now done by dereferencing T1. One might ask why is the pointer p stored in T1, why not directly stored the value *p instead? Well, this can be done in our example, because p is an integer pointer. But if it were a pointer to a large structure, the entire structure would have to be saved in a temporary variable! Thats why, I think, lcc chooses to save the pointer value.

Another interesting thing is the LSH operator. Since q is an array of integers, and one integer occupies 4 bytes, q[i] is at byte offset (q + 4 * i). This 4 * i is efficiently achieved by LSH 2. But then again look at the assemble generated for this (for win32 x86): mov dword ptr (-40)[ebp][esi*4],edi
Well, this is how thhe backend chooses to generate code.

Example#7: Function call

The C code
The generated IR
The corresponding assembly

This is an example to see how function calls are represented.

The C input



struct t_s

{

     int p, q;

};



extern struct t_s extStruct(int a, int b, int c);

extern int extInt(strcut t_s s);



int foo(int a, int b)

{

	  struct t_s s;

     return extStruct (a, b, b).q + extInt(s);

}

The second and third parameter in the call to extStruct above are same - b. This is deliberate. Lets see if this causes lcc to generate a temporary variable. Also, we have purposely called two functions - one that returns a compund type struct, and another that returns an int. Also the second one takes that compund type struct t_s. We want to know whether there's any difference in how such calls are represented. Lets straight away turn to the code list. In the code list we can learn how arguments are passed to other functions in a function call. and how returned values are handled.


ASGNI (ADDRLP T2) (CSE T2 (INDIRI (ADDRFP b))) 
 ARGI (INDIRI (ADDRLP T2))
 ARGI (INDIRI (ADDRLP T2))
 ARGI (INDIRI (ADDRFP a))
 ARGP (ADDRLP T1))
 CALLV (ADDRGP extStruct)
 ARGB (INDIRB (ADDRLP s))
 ASGNI (ADDRLP T3) (CSE T3 (CALLI (ADDRGP extInt))) 
 RETI (ADDI (INDIRI (ADDRLP A2)) (INDIRI (ADDRLP T3))

The code list above, warrants quite a bit of explanation. Looking at the call to extStruct, we see that lcc has indeed generated a temporary variable T2 to hold the value b. Next we see the use of ARG to represent the passing of arguments. The first three args push T2 (which holds the value held by b), T2, and a. This is as we expect. The next ARGP is interesting. Since extStruct returns a structure and not some simple type like int, the return value has to be held on stack. Therefore we have T1 - a generated temporary (using the Local interface) of type struct t_s. The address of T1 is passed to extStruct. The caller foo expects callee extStruct to dereference this pointer, and fill the area with the return value. Hence, after the CALLV (call to a void function), foo expects T1 to be filled with the return value. So we see that functions that return compund types are treated like functions that return nothing (void), but have an extra argument which is a pointer to the area where the return value is to be filled in.

Next is the call to extInt. extInt takes a structure and returns an integer. The passing of the structure is done using "ARGB s". This is nice and simple. The call this time is a call to a function returning an integer and henceis CALLI. The return value is held in a generated temporary T3 of type int.

The next interesting thing is A2 in the last tree (with root RETI). This is the first example of the use of the "Address" interface. Address : interface uid = 12 q = symbol id = 2 type = 4 scope = LOCAL sclass = AUTO ref = 10000 flags = computed|temporary|generated p = 11 n = 4

I am not entirely sure of every field. uid is the uid of this address (the backend will choose to store this address in a variable). q is the symbol that this address represents. The id here is 2 (not sure why not 1), and thats why I call it A2. The flags this time is "computed|temporary|generated" - the computed has come the first time. This is because this address has to be computed at run time. p is the symbol this address refers to. In our example this symbol (uid 11) is T1. T1 was that temporary variable of type struct t_s, a pointer to whom was passed to extStruct. Hence the return value of extStruct is in T1. Since for A2 n = 4, we see that A2 represents the integer at offset 4 from T1. Which, in the C code, is the field q in the value returned from extStruct.

Now if you look at the RETI tree this value is added to the value return by extInt, and this is what is desired.

Example#8: if-else if-else

The C code
The generated IR
The corresponding assembly

This is a small example to see the if-else if-else construct is represented.

The C input



int compare(int a, int b)

{

     if(a < b)

          return -1;

     else if(a == b)

          return 0;

     else

          return 1;

}

Okay, straight to the code list.


 GE ( INDIRI (ADDRFP a)) (INDIRI (ADDRFP b)) L2
 RETI (CNSTI -1)
 BRANCH L1
 LABEL L2 
 NE (INDIRI (ADDRFP a)) (INDIRI (ADDRFP b)) L4
 RETI (CNSTI 0) 
 BRANCH L1 
 LABEL L4 
 RETI (CNSTI 1) 
 LABEL L1

I have taken some liberties to make the list more readable. But this list is pretty much it. This really simple and almost self-explanatory.

Example#9: switch

The C code
The generated IR
The corresponding assembly

This is a small example to see the switch statement.

The C input



char foo(int a, int b)

{

     char ret;

     switch(compare(a, b))

     {

     case 1:

          ret = 'G';

          break;

               

     case 0:

          ret = 'E';

          break;

     

     case -1:

          ret = 'L';

          break;



     default:

          error("Should never get here\n");



     }

     return ret;

}

Okay, straight to the code list.


 ARGI (INDIRI (ADDRFP b)) 
 ARGI (INDIRI (ADDRFP a)) 
 ASGNI (ADDRLP T1) (CSE T1 (CALLI (ADDRGP compare))) 
 ASGNI (ADDRLP T4) (INDIRI (ADDRLP T1)) 
 ASGNI (ADDRLP T2) (CSE T2 (INDIRI (ADDRLP T4))) 
 EQ (INDIRI (ADDRLP T2)) (CNSTI -1) L7
 EQ (INDIRI (ADDRLP T2)) (CNSTI 0) L6
 EQ (INDIRI (ADDRLP T2)) (CNSTI 1) L5
 BRANCH L2
 LABEL L5 
 ASGNI (ADDRLP ret) (CNSTI 71) 
 BRANCH L3 
 LABEL L6 
 ASGNI (ADDRLP ret) (CNSTI 69) 
 BRANCH L3 
 LABEL L7 
 ASGNI (ADDRLP ret) (CNSTI 76) 
 BRANCH L3 
 LABEL L2 
 ARGP (ADDRGP G8) 
 CALLI (ADDRGP error) 
 LABEL L3 
 RETI (CVII (INDIRI(ADDRLP ret)))

Though this looks quite big there's nothing very complicated here. Observe that in a switch statement, all comparisons are next to each other in the code list. And so are all the actions for each case. Not also that (though its not very clear from the code list above), though the return type of foo is char, the char ret, was converted from INT of size 1, to an INT of size 4. Also, the three temporary variables T1, T4 and T2 seem unnecessary. You can see the redundant assignments after calling compare. I think this is just the result of the syntax directed translation scheme of lcc. I am not sure though. Another reason could be that the return value in most machines is typically stored in a fixed register. For x86-win32 its eax. Now the first of these seemingly redundant assignments maybe a way to have the value go out of this register to some temporary variable. Also, maybe at the beginning of any switch statement, the value being switched on is put into some register for better performance. This is what the second assignment could be for. But of course, one would think that the code generator back-end could easily do all this. Well, I am not too sure.

The G8 passed to error above refers to the literal string. The uid for that is 8, and I call it G8.

Thats all about the switch statement.

Example#10: while loop

The C code
The generated IR
The corresponding assembly

This is a small example to see how while loops are represented.

The C input



int sum10()

{

     int sum = 0;

     int i = 1;

     while(i <= 10)

     {

          sum += i;

 	       i++;

     }

     return sum;

}

Once again, there's not much of note here. So straight to the code list.


 ASGNI (ADDRLP sum) (CNSTI 0) 
 ASGNI (ADDRLP i) (CNSTI 1) 
 BRANCH L3 
 LABEL L2 
 ASGNI (ADDRLP sum) (ADD (INDIRI (ADDRLP sum)) (INDIRI (ADDRLP i)) 
 ASGNI (ADDRLP i) (ADDI (INDIRI (ADDRLP i)) (CNSTI 1))
 LABEL L3 
 LE (INDIR (ADDRLP i)) (CNSTI 10) L2 
 RETI (INDIRI (ADDRLP sum))

The code in itself is straight forward. If you look at the symbol declarations using Local interace for i and sum, they have been given storage class REGISTER.

Example#11: do loop

The C code
The generated IR
The corresponding assembly

This is a small example to see how do loops are represented.

The C input



int sum10()

{

     int sum = 0;

     int i = 1;



     do

     {

          sum += i;

 	       i++;

     }while(i <= 10);



     return sum;

}

The code list:


 ASGNI (ADDRLP sum) (CNSTI 0) 
 ASGNI (ADDRLP i) (CNSTI 1) 
 LABEL L2 

 ASGNI (ADDRLP sum) (ADD (INDIRI (ADDRLP sum)) (INDIRI (ADDRLP i)) 
 ASGNI (ADDRLP i) (ADDI (INDIRI (ADDRLP i)) (CNSTI 1))
 LABEL L3 
 LE (INDIR (ADDRLP i)) (CNSTI 10) L2 
 RETI (INDIRI (ADDRLP sum))

The code in itself is straight forward. If you look at the symbol declarations using Local interace for i and sum, they have been given storage class REGISTER. Also, note that the code is very much like that of while loop, except, of course, for that BRANCH node in the beginning. Even the label L3, thoug unnecessary seems to be there.

Example#12: for loop

The C code
The generated IR
The corresponding assembly

This is a small example to see how for loops are represented.

The C input



int sum10()

{

     int sum = 0;

     int i;

     for(i = 1; i <= 10; i++)

    {

          sum += i;

    }

     return sum;

}

The code list:


 ASGNI (ADDRLP sum) (CNSTI 0) 
 ASGNI (ADDRLP i) (CNSTI 1) 
 LABEL L2 
 ASGNI (ADDRLP sum) (ADDI (INDIRI (ADDRLP sum)) (INDIRI (ADDRLP i)) 
 LABEL L3 
 ASGNI (ADDRLP i) (ADDI (INDIRI (ADDRLP i)) (CNSTI 1))
 LE (INDIR (ADDRLP i)) (CNSTI 10) L2 
 RETI (INDIRI (ADDRLP sum))

Once again, this is very similar to do and while loops, the code list is pretty straight forward. Also, i and sum are both declared as REGISTER variables.

All in all...

We've seen lcc and the way it generates the intermediate representation. The ASDL support is really helpful for anybody who wants to play around with program analysis or transformations. All these examples, I hope, will give one a feel of the intermediate code, which is very necessary for anybody to cook up really cool optimization schemes. The various examples show that this intermediate code is not very huge or complicated, and might perhaps be nice to work with.

If there are any corrections or suggestions please let me know. Any kind of feedback is very welcome.
If you have any links or pointers to anything which you think might be of interest, please let me know.

Things still unknown and undone

How are the "ref" and "flags" field of symbol used by the lcc backend?
Why in example 7, the id of the address is 2 and not 1

lcc with ASDL and the IR it generates

Contents:

What is this all about

lcc's ASDL file

Examples

Example#1: An unitialized global integer

The C input

The Generated IR

The corresponding assembly code

Example#2: Initialized global variables

The C input

Example#3: Array, structure and union

The C input

Example#4: Local variables

The C input

Example#5: Function with parameters and arithmetic

The C input

Example#6: Pointer and array

The C input

Example#7: Function call

The C input

Example#8: if-else if-else

The C input

Example#9: switch

The C input

Example#10: while loop

The C input

Example#11: do loop

The C input

Example#12: for loop

The C input

All in all...

Things still unknown and undone